Proposal by jorgecarleitao · Pull Request #1 · jorgecarleitao/arrow2

jorgecarleitao · 2021-02-07T12:33:19Z

elferherrera · 2021-02-10T19:25:13Z

Wow, I just spent the past hour going through your proposal and it looks really great. I like that you have made all the memory implementation easier to understand. The code reads very rusty. I also liked the fact that most of the things are documented and well explained. I think it does make sense to think about implementing this as the base for arrow, specially with the improvement in performance that you are reporting.

Regarding the NativeType trait and its implementations, I couldn't understand why it has to be unsafe. Do you mind explaining that to me?

jorgecarleitao · 2021-02-11T02:57:33Z

I am thinking that while this is a major change for the implementation of Arrow, most of the APIs changed are what I would think of as implementation details.

The main difference happens in interactions with lower-end functionality, yes. But AFAIK folks at UrbanLogic (@maxburke) use them.

For higher-end functionality, the main change is that creating a primitive from an iterator has some more characters:

let array = iter.map(...).collect::<Primitive<i32>>().to(DataType::Date32);

vs

let array = iter.map(...).collect::<Date32Array>();

this is derived by the split between logical and physical parts of the array, so that we can e.g. create timezone-aware timestamps:

let array = iter.map(...).collect::<Primitive<i64>>().to(DataType::Timestamp(a, b));

which is not possible in current arrow without performing a cast the array or using ArrayData directly.

What do you have in mind for next steps?

evaluate if this covers all the necessary aspects for it to fly. I am adding some kernels on this new format, validate UX and coverage of requirements.
replace Buffer by Buffer<u8>. This has no semantic change but allow us to prepare the code to have typed buffers.
create all the traits required to read all attributes of all arrays, and implement them for all arrays.
Replace Array::data_ref/data by the new traits. The idea here is that we will continue to create arrays using ArrayData, but we do not use them to access data and instead rely on specific, type-specific methods for it. This requires a refactor of the transform and equal modules as well as ffi export, IPC writer and parquet writer (i.e. everything that is type-agnostic and relies on ArrayData).
Change array creation to not rely on ArrayData. After 4, code no longer relies reads of ArrayData and thus we can drop it entirely. This requires major changes to the Builder API, as well as anything that creates arrays, such as ffi import, IPC reader, parquet reader, etc, as they create arrays.

As you can see this really has a major impact in the crate as a whole, which is why is so hard to implement.

jorgecarleitao · 2021-02-11T03:03:48Z

Regarding the NativeType trait and its implementations, I couldn't understand why it has to be unsafe. Do you mind explaining that to me?

The arrow specification assumes that a buffer can only be of certain types, and requires specific memory alignments for these. We use that to safely transfer data over FFI boundaries, write to parquet, IPC, etc. Furthermore, our allocator has an optimization on which we assume that NativeType do not contain any pointers.

Thus,

struct A(HashMap<i32, i32>);

impl NativeType for A {};

would leak and would also result in undefined behavior. Thus, we mark the trait as unsafe to tell everyone "look, do not implement this trait without careful consideration about what you are doing with it: the arrow crate is not prepared to handle arbitrary structs". In the arrow crate it is not marked as such but it should be.

elferherrera · 2021-02-11T09:10:43Z

would leak and would also result in undefined behavior. Thus, we mark the trait as unsafe to tell everyone "look, do not
implement this trait without careful consideration about what you are doing with it: the arrow crate is not prepared to handle arbitrary structs". In the arrow crate it is not marked as such but it should be.

Gotcha. So it is a warning to someone wanting the implement the trait to another type. Yesterday I was playing with your code and saw that I could remove the unsafe option from the trait and it would compile.

Out of curiosity, would it not make sense to keep the trait hidden (not pub) instead of using unsafe?

jorgecarleitao · 2021-02-11T10:49:34Z

So it is a warning to someone wanting the implement the trait to another type. Yesterday I was playing with your code and saw that I could remove the unsafe option from the trait and it would compile.

Yes, that is the main use-case of unsafe. adding or removing unsafe does not have affect the produced binary; it is a keyword only used to flag errors. It works a bit like std::marker traits.

Out of curiosity, would it not make sense to keep the trait hidden (not pub) instead of using unsafe?

I agree with this sentiment. However, that would not allow people to write generics that depend on it. It is also not possible in Rust to publicly expose generic functions whose type parameters are not public, as it would "leak" a private trait via a public function. There are some discussions to allow a trait to be "sealed", but it is still in RFC phase, which I think would solve this problem.

elferherrera · 2021-02-11T13:19:44Z

I guess the sealed traits could be implemented like this example
https://rustype.github.io/notes/notes/rust-typestate-series/rust-typestate-part-1.html#stricter-states

ritchie46 · 2021-02-12T09:09:42Z

Very interesting read and I really itches something that I experienced when starting using Arrow, but got used to overtime due to exposure. What is the main idea, fork and continue as arrow2? Or make a MVP and hope to get that merged into the main project?

alamb · 2021-02-13T10:57:16Z

I hope we don't end up with a fork. While it will be painful I think starting to break this PR up into pieces and bring it in incrementally into the main arrow codebase would be the most plausible way to bring the idea to fruition.

jorgecarleitao · 2021-02-13T13:12:30Z

I also hope so, and I am working towards a plan to have this merged on the main repo.

Technically, the core hypothesis that I am testing atm is that the way this repo handles offsets works in integration with at least IPC integration tests. I am really uncertain here.

This PR uses a different approach to offsets, as they are no longer tracked only on ArrayData and instead each Buffer/Bitmap tracks its own, the same way Tokyo-bytes does it. This is dramatically easier to work with, but I am unsure whether it will work with IPC and FFI. If it does not work, I may have to revisit this whole thing. I working towards having the json IO migrated to this repo, as the integration tests for IPC use json.

elferherrera · 2021-02-13T21:42:01Z

Another thing that could help to get traction behind your idea is to have more performance comparisons between actual Arrow and the new implementation. I know you have put a lot of time into this, so if you would like help testing something let me know and I can help with that. A list of the things that you want to test could be useful se we can work on that.

Sorry to piggy-bag on this thread to ask something related to your NativeType trait implementation, but would using a private module (example) have the safe effect you are looking for sealing the traits?

jorgecarleitao · 2021-02-28T08:11:37Z

Sorry for the late reply, but last weeks have been pretty busy. However, I did have time to work on this.

I have now concluded the feasibility study that I wanted to do on this.

I was able to make all IPC integration tests for both file and stream pass, both for little endian and big endian (which we do not even support in master atm).
I was unable to run read parquet integration tests because we have none: we only test that we read the files, not that the result is correct.
- I used pyarrow to get expected values to some of the parquet files we have available, but AFAIK they do not contain nulls, so I was unable to test the validity stuff. This approach is also not scalable.
- I was able to make this fork parse primitive types from parquet files
- I have not tested the write, mostly because we have no baseline to compare against
- I was unable to correctly read variable-sized arrays from parquet. I am not sure if this is due to my changes or an issue on master already (AFAIK we do not test that).

I thus conclude that the biggest risk for this endeavor is regressions on the parquet IO.

Recommended actions:

Park this until we have sufficient coverage for reading and writing parquet files to perform a migration.
Raise the possibility in the mailing list of creating golden parquet files and corresponding .arrow files (e.g. read the parquet from the c++ implementation and write .arrow), so that Rust (and others) can read them and compare against the respective .arrow to verify that the in-memory matches. This will give us IO parity across implementations, which in our case is necessary to validate that the readers and writers are consistent across languages.

alamb · 2021-02-28T12:10:37Z

golden parquet files and corresponding .arrow

I think this is an excellent idea.

jorgecarleitao · 2021-03-04T20:48:36Z

I am closing this PR, as I will start the work of making this repo usable and stuff.

The ideas above hold, but I plan to use this repo to have the version of the code with a transmute-free implementation of arrow.

jorgecarleitao · 2021-03-04T22:21:03Z

The repo now contains the design and implementation that I have been baking over the past months. Some modules have a README with the actual design notes of them (i.e. MUST, MAY, SHOULD, etc).

The repo has most things implemented with the notable exception of parquet IO, which I am still trying to grasp.

I feature-gated almost everything so that the crate depends on 3 small dependencies and chrono.

Many, many components were re-written from scratch because they left me no other choice. I also deprecated the "Builder" API, as it is entirely replaced by a FromIter equivalent that supports arbitrary (typed) nesting.

The specification is always validated when an array is created and there is little room for unsoundness.

alamb · 2021-03-08T21:39:39Z

😮 - so now the next question is, "what next and what can we do to help @jorgecarleitao "?

jorgecarleitao added 19 commits February 7, 2021 13:32

Initial commit

67bb67a

Added buffer.

97c6b12

Added Datatypes.

b11f692

Added primitive.

d4c2f44

Added List.

7949bfa

Added Binary.

3063fa9

fixed binary.

3720298

Dictionary.

db9a1d0

Slices.

54c5a2a

Added equality

b2de89c

From vec.

5bb6697

Added tests for equal

f9b40f3

Fixed errors.

8184d2a

Added take kernel and others.

fff8bfd

String and boolean

2d35c99

refactor datatypes.

54c57e9

Added slice and value()

061e8ea

Added struct.

f2fc45c

Added FFI.

6270738

jorgecarleitao force-pushed the proposal branch 2 times, most recently from 57a129d to d16c779 Compare February 7, 2021 13:20

jorgecarleitao added 9 commits February 7, 2021 15:53

Added Recordbatch

da5a325

Moved stufff

0e9c2e6

Added dictionary and empty.

7fd561d

Added NullArray

3f858f1

Added FixedSizeList.

2c515d2

Validated datatype of primitive.

8e40f2b

Moved file.

5f5cdf2

Moved content around.

3c26c6f

Added falible version of from_trusted_len_iter.

a9e5c3f

jorgecarleitao added 4 commits February 10, 2021 07:26

Added methods to build Dictionary.

e397921

Added builder for Utf8

d82fade

Fixed bug.

eb1a7a3

Added cast.

83a19e1

jorgecarleitao added 3 commits February 11, 2021 06:27

Improved creation.

db86fb8

Minor fixes.

671b017

Added CSV writer.

57787a3

abreis and others added 4 commits February 12, 2021 06:49

Fix spelling of 'immutable'

abaf788

Added text

636af5e

Fix spelling of 'immutable'

7857759

Tighten up the proposal

06416ac

jorgecarleitao force-pushed the proposal branch from e628dcf to 06416ac Compare February 12, 2021 05:50

alamb mentioned this pull request Feb 14, 2021

ARROW-11598: [Rust] Split buffer.rs in smaller files apache/arrow#9473

Merged

jorgecarleitao closed this Mar 4, 2021

jorgecarleitao deleted the proposal branch August 22, 2021 21:00

Conversation

jorgecarleitao commented Feb 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elferherrera commented Feb 10, 2021

Uh oh!

jorgecarleitao commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Feb 11, 2021

Uh oh!

elferherrera commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Feb 11, 2021

Uh oh!

elferherrera commented Feb 11, 2021

Uh oh!

ritchie46 commented Feb 12, 2021

Uh oh!

alamb commented Feb 13, 2021

Uh oh!

jorgecarleitao commented Feb 13, 2021

Uh oh!

elferherrera commented Feb 13, 2021

Uh oh!

jorgecarleitao commented Feb 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 28, 2021

Uh oh!

jorgecarleitao commented Mar 4, 2021

Uh oh!

jorgecarleitao commented Mar 4, 2021

Uh oh!

alamb commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jorgecarleitao commented Feb 7, 2021 •

edited

Loading

jorgecarleitao commented Feb 11, 2021 •

edited

Loading

elferherrera commented Feb 11, 2021 •

edited

Loading

jorgecarleitao commented Feb 28, 2021 •

edited

Loading